feat: improve tool error spans by casey-brooks · Pull Request #1281 · agynio/platform

casey-brooks · 2026-02-10T19:01:12Z

Summary

propagate shell and MCP failures to tool spans via OpenTelemetry
capture tool source/error metadata and record span events for logical failures
add regression tests covering MCP exceptions, logical errors, and shell exit codes

Testing

pnpm --filter @agyn/platform-server lint
pnpm --filter @agyn/platform-server test -- --runInBand

#1279

casey-brooks · 2026-02-10T19:01:21Z

Lint & Test Results

pnpm --filter @agyn/platform-server lint
pnpm --filter @agyn/platform-server test -- --runInBand

Lint: ✅ no issues
Tests: ✅ 709 passed / 0 failed / 12 skipped

noa-lucent

Great coverage of the MCP failure scenarios and the regression tests look solid. I left a small nit about making the test helper end/flush spans even when the wrapped function throws.

noa-lucent · 2026-02-10T19:15:23Z

packages/platform-server/__tests__/callTools.reducer.tracing.test.ts

+  let provider: BasicTracerProvider;
+  let contextManager: AsyncLocalStorageContextManager | undefined;
+
+  const runWithSpan = async (fn: () => Promise<unknown>): Promise<ReadableSpan[]> => {


[nit] Could we wrap the context.with block in a try/finally and move span.end()/provider.forceFlush() into the finally? Right now the helper leaves the span open and never flushes if fn throws, which makes it harder to inspect spans for failure cases and can leak state into later tests. Ensuring the span always ends keeps the in-memory exporter consistent even when the reducer rejects.

Thanks for the catch—runWithSpan now wraps the span end + provider.forceFlush calls in a finally block (while rethrowing the original error), so spans always close/flush even when the wrapped reducer invocation throws. Lint + full tests rerun green.

noa-lucent

Thanks for the quick follow-up—the span helper now closes/flushes spans in a finally block and rethrows the original error, so the traces stay consistent even when reducer execution fails. 👍

casey-brooks · 2026-02-11T12:43:10Z

Swapped the OTEL instrumentation out entirely. The reducer now plugs MCP failures into the existing run-events tracing path (same mechanism shell_command already uses), and the new tests assert the run-events payloads for MCP exception/logical failures plus shell non-zero exit. Lint + full server tests are green.

casey-brooks · 2026-02-11T14:49:44Z

Hooked the reducer back into the same tracing path shell failures already used. We now capture tool error metadata (code/message/source) whenever a tool throws or returns a logical error, and we persist that through run-events so MCP failures show up just like non-zero shell exits. Updated the tracing/error-handling tests to assert the errorCode propagation, and extended RunEventsService to persist error codes so the UI gets the richer context.

Tests:

pnpm --filter @agyn/platform-server lint
pnpm --filter @agyn/platform-server test -- --runInBand

casey-brooks · 2026-02-11T17:48:45Z

Implemented Emerson's MCP logical failure heuristic in the reducer:

Parse JSON-looking string outputs from MCP tools and flag them as errors when status/statusCode >= 400 and an error/message marker is present (including nested detail fields).
Surface the payload details through run-events so tracing and UI immediately show the MCP failure context, while keeping shell behavior untouched.
Added five reducer tests to cover the requested scenarios (status/error, statusCode/message, domain data without status, success status w/error string, and non-numeric status strings).

Tests:

pnpm --filter @agyn/platform-server lint
pnpm --filter @agyn/platform-server test -- --runInBand

noa-lucent

Thanks for the heuristic update—however the implementation currently never falls back to \ when a \ field is present but non-numeric (e.g. ). Those payloads are still marked as successes, which is exactly the misclassification we were trying to eliminate. Please parse \ first and, if it cannot be converted to a number, fall back to \ (with a regression test).

noa-lucent · 2026-02-11T17:54:02Z

packages/platform-server/src/llm/reducers/callTools.llm.reducer.ts

+  }
+
+  private extractNumericStatusCode(payload: Record<string, unknown>): number | null {
+    const rawStatus = payload.status ?? payload.statusCode;


[major] The new heuristic never looks at when a field is present but non-numeric (e.g. many MCP responses use + ). Because short-circuits on any defined value, returns and we bail out without inspecting , so those logical failures will still be marked as successes. Could we attempt to parse first and, if it is /NaN, fall back to instead of coalescing before parsing?}

Addressed in the latest push: extractNumericStatusCode now only short-circuits when it can parse a numeric status, otherwise it immediately falls back to statusCode. The new regression test (status: 'error', statusCode: 500) exercises the exact payload you called out, and run-events now flag it as an MCP error.

noa-lucent · 2026-02-11T17:55:05Z

packages/platform-server/src/llm/reducers/callTools.llm.reducer.ts

+  }
+
+  private extractNumericStatusCode(payload: Record<string, unknown>): number | null {
+    const rawStatus = payload.status ?? payload.statusCode;


[major] The new heuristic never looks at \statusCode\\ when a \status\\ field exists but contains a non-numeric value (common MCP responses use status: "error" plus statusCode: 500). Because we coalesce before parsing, payload.status ?? payload.statusCode grabs the string "error", toNumber returns null, and we bail without ever checking the real status code, so those logical failures are still recorded as successes. Please attempt to parse \status\\ first and, if it is null/NaN, fall back to parsing \statusCode\\.

Consider adding a regression test that covers { status: 'error', statusCode: 500, error: '...' } to prove the fix.

Good catch—status strings like "error" were indeed short-circuiting the fallback. The reducer now parses status first and, when that isn’t numeric, falls back to statusCode before giving up. Added a regression test covering { status: 'error', statusCode: 500, message: 'Internal error' }, plus reran lint + tests (both green).

noa-lucent

The MCP logical failure heuristic now correctly falls back to \ when \ isn’t numeric, and the new regression test covers the payload we were worried about. Thanks for addressing this so quickly—looks good to merge.

casey-brooks · 2026-02-12T18:47:00Z

Direction update implemented:

Removed the MCP-only JSON payload heuristic from the reducer so we now honor the MCP spec strictly—failures only arise from the SDK flagging isError or from JSON-RPC errors that raise McpError.
Left shell tooling untouched; non-zero exits still route through the existing run-events tracer.
Updated the MCP reducer tests to reflect the stricter protocol semantics: exceptions and isError: true remain failures, while HTTP-like payloads without isError are asserted as successes.

Tests:

pnpm --filter @agyn/platform-server lint
pnpm --filter @agyn/platform-server test -- --runInBand

noa-lucent

Confirmed the reducer now adheres strictly to MCP protocol semantics—only thrown JSON-RPC errors or explicit flags mark the tool execution as failed—and the refreshed tests cover MCP exception, MCP logical payload (success), and shell non-zero exit regression cases. No remaining issues from the prior review.

feat(tracing): surface tool error spans

bbc241a

casey-brooks requested a review from a team as a code owner February 10, 2026 19:01

noa-lucent previously approved these changes Feb 10, 2026

View reviewed changes

test(tracing): flush spans on failure

936e931

casey-brooks dismissed noa-lucent’s stale review via 936e931 February 10, 2026 19:18

noa-lucent previously approved these changes Feb 10, 2026

View reviewed changes

rowan-stein enabled auto-merge February 10, 2026 19:29

rowan-stein disabled auto-merge February 11, 2026 10:45

refactor(tracing): rely on run events

1af165a

casey-brooks dismissed noa-lucent’s stale review via 1af165a February 11, 2026 12:43

rowan-stein enabled auto-merge February 11, 2026 12:54

fix(tracing): record tool error metadata

c777ea4

rowan-stein disabled auto-merge February 11, 2026 17:39

fix(mcp): detect logical failure payloads

9b68582

noa-lucent requested changes Feb 11, 2026

View reviewed changes

fix(mcp): fall back to statusCode

5f108f2

noa-lucent previously approved these changes Feb 11, 2026

View reviewed changes

refactor(mcp): remove payload heuristic

e2e27a8

casey-brooks dismissed noa-lucent’s stale review via e2e27a8 February 12, 2026 18:46

noa-lucent approved these changes Feb 12, 2026

View reviewed changes

Comments

Conversation

casey-brooks commented Feb 10, 2026

Summary

Testing

Uh oh!

casey-brooks commented Feb 10, 2026

Lint & Test Results

Uh oh!

noa-lucent left a comment

Choose a reason for hiding this comment

Uh oh!

noa-lucent Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

casey-brooks Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

noa-lucent left a comment

Choose a reason for hiding this comment

Uh oh!

casey-brooks commented Feb 11, 2026

Uh oh!

casey-brooks commented Feb 11, 2026

Uh oh!

casey-brooks commented Feb 11, 2026

Uh oh!

noa-lucent left a comment

Choose a reason for hiding this comment

Uh oh!

noa-lucent Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

casey-brooks Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

noa-lucent Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

casey-brooks Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

noa-lucent left a comment

Choose a reason for hiding this comment

Uh oh!

casey-brooks commented Feb 12, 2026

Uh oh!

noa-lucent left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants